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Observed data are often contaminated by undiscovered interlopers, leading to biased parameter 
estimation. Here we present BEAMS (Bayesian Estimation Applied to Multiple Species) which sig- 
nificantly improves on the standard maximum likelihood approach in the case where the probability 
for each data point being 'pure' is known. We discuss the application of BEAMS to future Type la 
supernovae (SNIa) surveys, such as LSST, which are projected to deliver over a million supernovae 
lightcurves without spectra. The multi-band lightcurves for each candidate will provide a proba- 
bility of being la (pure) but the full sample will be significantly contaminated with other types of 
supernovae and transients. Given a sample of TV supernovae with mean probability, (P), of being 
la, BEAMS delivers parameter constraints equal to N{P) spectroscopically-confirmed SNIa. In ad- 
dition BEAMS can be simultaneously used to tease apart different families of data and to recover 
properties of the underlying distributions of those families (e.g. the Type Ibc and II distributions). 
Hence BEAMS provides a unified classification and parameter estimation methodology which may 
be useful in a diverse range of problems such as photometric redshift estimation or, indeed, any 
parameter estimation problem where contamination is an issue. 

PACS numbers: 98.80.Es 



I. INTRODUCTION 

Typically parameter estimation is performed with the 
assumption that all the data come from a single under- 
lying probability distribution with a unique dependence 
on the parameters of interest. In reality the dataset is 
invariably contaminated by data from other probability 
distributions which, left unaccounted for, will bias the 
resulting best-fit parameters. This is a typical source of 
systematic error. 

In this paper we present BEAMS (Bayesian Estima- 
tion Applied to Multiple Species), a method that allows 
for optimal parameter estimation in the face of such con- 
tamination when the probability for being from each of 
the distributions is known. As a by-product our method 
allows the properties of the contaminating distribution 
to be be recovered. 

For example, the next decade will see an explosion 
of supernova data with particular emphasis on Type la 
supernovae (SNIa) as standard candles. A few hun- 
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dred supernovae were known by 2005, see [J, H, 0, 0, 
S @, 0] and references therein. The current genera- 
tion of SNe surveys will last to around 2008 and in- 
clude SNLSp|,[<l, ESSENCE[iaj^SDSS-nG3, [H, 
CSP [IlLpJ^KAIT 0, CfA[31l|, C-T[ljand SN 
Factory[17| and will yield of order 10 3 good SNIa with 
spectra. Proposed next-generation supernova surveys in- 
clude the Dark Energy Survey [3, Pan-STARRS [H 
and SKYMAPPERdg and will deliver of order 6 x 10 4 
SNIa by 2013, the majority of which will not have spec- 
tra. Beyond this the projected ALPACA telescope [2l[ 
would find an estimated 10 5 SNIa over three years. The 
exponential data rush will culminate in the LSST [22| . 
[4l| which is expected to discover around 2 x 10 5 SNIa 
per year, yielding a catalog with over two million SNIa 
multi-colour light curves over a ten year period. The 
vast majority of these candidates will not have associ- 
ated spectra. 

Fortunatel y r ecent surveys such as HST, SNLS and 
SDSS-II dilH HI, El building on earlier work have 
convincingly shown that a probability of any object be- 
ing a SNIa can be derived from multi-colour photomet- 
ric observations of the candidate. This has become a 
very active area of research with significant recent ad- 
vances pursuing a primarily Bayesian approach to the 
problem (2(| [27], Ha Hil and suggesting that the future 



high-quality, multi-epoch lightcurves will provide accu- 
rate (i.e. relatively unbiased) probabilities of being each 
possible type of supernova (or of not being a supernova 
at all). 

However, since a less than 100% probability of being 
la is insufficient for the standard parameter estimation 
methodology, these probabilities - no matter how accu- 
rate they are - are useless and have been relegated to 
use in selecting targets for spectroscopic follow-up as it 
has always been considered imperative to obtain spectra 
of the candidates to find la's, reject interlopers and to 
obtain a rcdshift for the SNIa. 

As a result, even with the relatively small number of 
supernova candidates today it is impossible to obtain 
spectra for all good potential SNIa candidates. Instead 
only the best candidates are followed up. For LSST and 
similar telescopes less than 0.1% of likely SNIa candidates 
will be followed up spectroscopically. Unfortunately a 
spectrum for a high-z object is typically very costly to 
obtain, with the required integration time roughly scaling 
as (l + z) a with a somewhere between 2 and 6, depending 
on the specific situation. In practise the situation is more 
complex since key identifying features such as the Si II 
absorption feature at a rest frame 6150A are redshifted 
out of the optical at z ~ 0.4 requiring either infra-red ob- 
servations or higher signal-noise spectra of the remaining 
part of the spectrum. 



• True Data 

12 

v Contaminating Data 

11.5- 




8.5 L , , , , ^ 

20 40 60 80 100 

FIG. 1: Schematic illustration of the problem: data drawn 
from the true distribution (e.g. Type la supernovae) are con- 
taminated by similar looking data from a different distribu- 
tions (e.g. Type Ibc or II supernovae) leading to biasing in 
the best-fit for parameters and in their errors. 

Until now the choices available in dealing with such a 
flood of candidates were limited. Either one could limit 
oneself to those candidates with spectra, rejecting the 
vast majority of candidates, or one could imagine us- 
ing the full dataset - including the contaminating data - 
to perform parameter estimation. However, undertaking 
this in a naive way - such as simply accepting all candi- 
dates which have a probability of being a SNIa greater 
than some threshold, P* - will lead to significant biases 
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FIG. 2: Schematic illustration of the underlying distributions 
used in Figure ([T} for the true and contaminating data. In 
the case of supernovae, Type la have a much more narrow in- 
trinsic scatter in their intrinsic luminosity (narrow Gaussian) 
compared to other types of supernovae (wide Gaussian) which 
also have different means. 



and errors that will undermine the entire dataset. 

In contrast, we introduce in this paper a statisti- 
cally rigorous method for using the candidates with- 
out spectroscopic confirmation for parameter estimation. 
BEAMS offers a fully Bayesian method for appropriately 
weighting each point based on its probability of belonging 
to each underlying probability distribution (in the above 
example, its probability of being a SNIa, SNIbc, type II 
etc.). We will show that this leads to a parameter esti- 
mation method without biases (as long as the method for 
obtaining the probabilities is sound) and which improves 
significantly the constraints on (cosmological) parame- 
ters. 

We will be guided by resolving this specific problem, 
but the underlying principles and methods are more gen- 
eral and can be applied to many other cases. In order not 
to obscure the general aspects, we will skip over some 
details, leaving them for future work where actual su- 
pernova data is analysed. We will therefore assume here 
that we know the redshift of the supernovae (or of its 
host galaxy), and that we already have estimated the 
probabilities Pj that the j-th supernova is a SN-Ia (eg. 
by fitting the lightcurves with templates). 

To give a simple example, imagine that we wish to esti- 
mate a parameter 9 (which in cosmology could for exam- 
ple represent the luminosity distance to a given redshift) 
from a single data point, D, which could have come from 
one of two underlying classes (e.g. supernova Type la or 
Type II), indexed by r = A, B (with their own probabil- 
ity distributions P(8\D,t), for the parameter 6). Again 
considering SNe, the link between luminosity and the 
luminosity distance could be different for the different 
classes of supernovae due to their intrinsic distribution 
properties, so given the data D, what is the posterior 
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likelihood for 8 assuming that we also know the proba- 
bility, P T that the data point belongs to each class, r? 

Clearly, Pa = 1 — Pb since we assume the point could 
come from only one of two classes. Secondly, as P T — > 
0,1, the posterior should reduce to one or other of the 
class distributions. Hence by continuity, the posterior we 
are seeking should have the form: 

P{6\D) = f(P A )P(0\D, t = A)+ g(P B )P(0\D, t = B) 

. w 

where the continuous functions / and g have the limits 
/(0) = = 0(0) and /(l) = 1 = g{l). Since all the 
posteriors are normalised we have that J P(8\D)d9 — 
1 = J P(8\D,T)d8. We immediately find that g(P A ) = 
1 — /(l — Pa)- The simplest - and as we will show 
later, Bayesian - choice for / is simply the linear func- 
tion: /(Pa) = Pa- In this case the full posterior simply 
becomes: 

P(6\D) = P A P{0\D, t = A) + (1 - P A )P(0\D, t = B) 

(2) 

This can be easily understood: the final probability dis- 
tribution for 9 is a weighted sum of the two underlying 
probability distributions (one for each of the classes) de- 
pending on the probabilities Pa, Pb (= 1 — Pa) of belong- 
ing to each of the two classes. 

We will see that our general analysis bears this simple 
intuition out (see e.g. equation (|13I) V 

II. FORMALISM 

A. General case 

Let us derive in a rather general way the required for- 
mulae. Starting from the posterior distribution of the 
parameters, P(9\D) we can work our way towards the 
known likelihood by repeated application of the sum and 
product rules of probability theory. The crucial first step 
involves writing explicitly the marginalisation over differ- 
ent data populations, represented by a logical vector r. 
Each entry n is either A if the supernova i is of type la, 
and B if it is not. With each entry we associate a prob- 
ability Pi that t, = A, so that the probability for r, = B 
is 1 — Pi. For now we assume that these probabilities are 
known. We can then write 

P(8\D)=J2P(0,r\D) (3) 

T 

where the sum runs over all possible values of r. Using 
Bayes theorem we get 

P(9,t\D)=P(D\9,t)^^-. (4) 

The "evidence" factor P(D) is independent of both the 
parameters and r and is an overall normalisation that 
can be dropped for parameter estimation. We will fur- 
ther assume here that P(9, r) w P(9)P(t). This simplifi- 
cation assumes that the actual parameters describing our 



universe are not significantly correlated with the proba- 
bility of a given supernova to be of type la or of some 
other type. Although it is possible that there is some 
influence, we can safely neglect it given current data as 
our parameters are describing the large-scale evolution of 
the universe, while the type of supernova should mainly 
depend on local gastrophysics. In this case P(9) is the 
usual prior parameter probability, while P(t) separates 
into independent factors, 

p(r)= n p * nt 1 -^)' ( 5 ) 

Ti—A Tj=B 

Here the product over 'Vj = A" should be interpreted as 
a product over those j for which Tj — A. In other words, 
given a population vector r with entries "A" for SN-Ia 
and "_B" for other types, the total probability P(t) is 
the product over all entries, with a factor Pj if the j-th 
entry is "A" and 1 — Pj otherwise (if the j-th entry is 
"£?"). Notice that we discuss here only one given vector 
t, the uncertainty is taken care of by the outer sum over 
all possible such vectors. The full expression is therefore 

P{9\D)kP{9)Y,P{D\9,t) I] p i IT^-^)- ( 6 ) 

T Ti=A Tj=B 

The factor P(D\8, t) here is just the likelihood. In gen- 
eral we have to evaluate this expression, which is com- 
posed of 2 N terms for N supernovae. The exponential 
scaling with the number of data points means that we 
can in general not evaluate the full posterior - but it 
should be sufficient to fix = A for data points with 
Pi w 1 and tj = B for Pj 0, and to sum over the 
intermediate cases. This should give a sufficiently good 
approximation of the the actual posterior. 

B. Uncorrelated data 

In the case of uncorrelated kinds of data or mea- 
surements, such as is approximately true for supernovae 
[431, we can apply the huge computational simplification 
pointed out in [30(. In this case, the likelihood decom- 
poses into a product of independent probabilities, 

P(D\8,t) = H P(D i \9,r i =A) J] P(Dj\9,Tj = B). 

Ti—A Tj — B 

The posterior is now a sum over all possible products 
indexed by the components Tj. We can simplify it, and 
bring it into a form that lends itself more easily to the 
extensions considered in a later section, by realising that 
all binomial combinations can be generated by a product 
of sums of two terms, 

e n A i n Bj=n(A k +B k ). ^ 

T Ti=A Tj=B k 

In this schematic expression, the Ai correspond to the 
product of likelihood and prior for a = A entry, and 
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the Bj to the same product for a Tj = B entry. So 
instead of a sum over 2 N terms, we now only deal with 
N products. 

How do the A% and Bk look for our supernova ap- 
plication? Let us assume that we are dealing with two 
populations, a population A of SNe la and population B 
of non-las. For the fc-th supernova Ak is then the product 
of the probability of being type la with the likelihood 
P(Dk\0,Tk = A). But since this likelihood is conditional 
on the supernova being indeed of type la, it is just the 
normal type-la likelihood which we will call Ca^- on 
the other hand is the probability 1 — of not being type 
la times the likelihood of the supernovae that are not la, 
which we will call Cs,k- 

Ca,i is therefore the probability that the ?-th data 
point has the measured magnitude if it is type la. It 
is just the usual likelihood, typically taken as a x 2 m 
the magnitudes. With the i-th supernova data given as 
distance modulus /x, and total combined error cr, (the in- 
trinsic and measurement errors computed in quadrature) 
it is simply 

P(Di\e, n = 1) = CaM = -1 e -x?/a, (9) 

with xf = — rn(9)) 2 /of where m{9) is the theoret- 
ical distance modulus (at redshift Zi). We emphasise 
that here the normalisation of the likelihood is impor- 
tant - unlike in standard maximum likelihood parameter 
estimation - as we will be dealing with different distri- 
butions and their relative weight depends on the overall 
normalisation. In the case of SNe we can of course go a 
level deeper, since the fii are estimated from a number 
of light-curve points in multiple filters. We could start 
directly with those points as our fundamental data. Here 
we ignore this complication while noting that in an actual 
application this would be the optimal approach [44| 

The likelihood Cb.% of a non-la supernovae is harder. 
In an ideal world we would have some idea of the distri- 
bution of those supernovae, so that we can construct it 
from there (see e.g. [3ljj ) . If we do not know anything, 
we need to be careful to minimise the amount of infor- 
mation that we input. It is tempting to use an infinitely 
wide flat distribution, but such a distribution is not nor- 
malisable. Instead we can assume that the non-la points 
are offset with respect to the "good" data and have some 
dispersion. The natural distribution given the first two 
moments (the maximum entropy choice) is the normal 
(Gaussian) distribution. The potentially most elegant 
approach is to use the data itself to estimate the width 
and location of this Gaussian. This is simply done by 
allowing for a free shift b and width £ and marginalising 
over them. Optimally we should choose both parameters 
independently for each redshift bin, in the case where we 
have many supernovae per bin. Otherwise it may be best 
consider b as a relative shift with respect to the theoret- 
ical value, modelling some kind of bias. 

We would like to emphasise that our choice of the nor- 
mal distribution for the non-la points is the conservative 



choice if we want to add a minimal number of new pa- 
rameters. It does not mean that we assume it to be 
the correct distribution. In tests with a uniform and a 
X 2 type distribution for the non-la population, assum- 
ing a normal distribution sufficed to reliably remove any 
bias from the estimation process relying on the la data 
points. If we have a very large number of non-la points 
we could go beyond the normal approximation and try 
to estimate the distribution function directly, e.g. as a 
histogram. On the other hand, the more parameters we 
add, the harder it is to analyse the posterior. Also, if we 
knew the true distribution of the contaminants then we 
should of course use this information. Going back to the 
full likelihood, we now write [45| 

P(D\6,r) = ^P(A&,E|0,r) (10) 

= ^P(D|&,E,0,t)P(6,E). (11) 

6,S 

The last term is the prior on the non-la distribution. In 
the absence of any information, the conventional (least 
informative) choice is to consider the two variables as in- 
dependent, with a constant prior on b and a 1/E prior 
on the standard deviation. In reality, the sum written 
here is an integration over the two parameters, and the 
choice of prior is degenerate with the choice of integra- 
tion measure. As there are no ambiguities, we will keep 
using summation symbols throughout, even though they 
correspond to integrals for continuous parameters. 

The type-la supernovae are independent of the new 
parameters. They are only relevant for the non-la likeli- 
hood, which is now for supernova j 

P(Dj\0,b,T,,Tj = B) = £ Bj (#,&,£) 

I (Hj- m (0)-b) 2 

= _ e ^ (12) 

(in an actual application to supernova data we would 
take S to be the intrinsic dispersion of the non-la pop- 
ulation, and add to it the measurement uncertainty in 
quadrature). The posterior, Eq. ([6]), is then 

p(9\d) (x^2p(e)p(b)p(j:) x 

N 

n {£A,M p i + w^exi - Pj)} (13) 

3=1 

An easy way to implement the sum over b and £ is to in- 
clude them as normal variables in a Markov-chain montc 
carlo method, and to marginalise over them at the end. 
Additionally, their posterior distribution contains infor- 
mation about the distribution of the non-la supernovae 
that can be interesting in their own right. 
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III. A TEST-IMPLEMENTATION 

In general 6 could of course be a vector of cosmologi- 
cal parameters, but in this section we consider the simple 
case of the estimation of a constant, corresponding for ex- 
ample to the luminosity distance in a single bin for the SN 
case. Continuing with the SN example for simplicity, the 
data Di then corresponds to some mj, an apparent mag- 
nitude for each SN in a bin. We again assume that there 
are two populations, type A (corresponding to SNIa) and 
type B (everything else). 

We fix a distribution for the type A probabilities Pj, for 
simplicity we take /(Pi) oc Pj, i.e. a distribution that is 
linearly increasing so that we are dealing predominantly 
with objects of type A. We then draw a Pi from this 
distribution, and choose an actual type with that proba- 
bility. Finally, we add a "spectroscopic" sample for which 
Pi = 1, i.e. these are guaranteed to be of type A. 

We take the type A population to have a known 
Gaussian distribution with mean /ia = and variance 
o A = 0.1. The unknown distribution of type B is taken 
to be another Gaussian, with mean fis = 2 and variance 
(7b = 2. To all data points, A and B, we assign the er- 
ror bar of type A, i.e. <Ji = a a (but we fit for the error 
bar of the population B). We assume that this error has 
been derived e.g. from the dispersion of the spectroscopic 
sample and that we do not know the distribution of the 
sample B [46|. 

The parameters that are being fitted from the data 
are then /xa, and <tb, with a a fixed from the spec- 
troscopic sample, and Pi fixed for each point from an 
assumed previous step in the analysis (e.g. Pi obtained 
from goodness of fit to template lightcurves) . As a side 
remark, although a a is here assumed to be known from 
the dispersion of the spectroscopic sample, it can also be 
fitted for jointly with the other parameters, which was 
done in tests of the method [47| ; the assumption of fixed 
known Pi will be relaxed in later sections. To connect 
this highly simplified example with cosmology, we shall 
pretend that we consider here only one redshift bin, and 
that the same analysis is repeated for each bin. The value 
of /ia could then be the distance modulus fj, in one bin, 
and an unbiased estimate in all bins would then constrain 
cosmological parameters like fi m , Q\ etc. The smaller the 
errors on /i^, the better the constraints. The data from 
population B on the other hand give us no information 
on the distance modulus, hence we must reduce contam- 
ination from population B. The posterior that results 
(explicitly indicating that we estimate ha) is then 

P(HA\D,a A ) oc V — x 

^ OB 

N 

\[[P 3 CaAva,cja) + (1 - P^Cbj^b^b)}, (14) 

where the population B mean fiB and the variance <jb 
have taken over the role of the shift b and variance £ of 



Parameter Value Bias [a] 
HA -0.003 ± 0.004 0.8 
/is 2.00 ±0.11 0.0 

a B 1.90 ±0.07 1.4 



TABLE I: Example results for the basic algorithm applied to 
a sample of 10 "spectroscopic" and 1000 "photometric super- 
novae" in a bin. The bias column shows the deviation from the 
true value, in units of the standard deviation. A deviation of 
about la is expected, while about one in twenty realisations is 
more than 2a away for random data with normal distribution. 
BEAMS also allows to recover the parameters characterising 
the contaminating distribution, /is and erg. 

the last section. 

As the population B is strongly biased with respect 
to A, the algorithm needs to detect the type correctly to 
avoid wrong results. Table U shows results from an exam- 
ple run with the above parameters, 10 spectroscopic and 
1000 photometric data points, where the spectroscopic 
points are data generated in a Monte Carlo fashion from 
normally distributed population A and the photometric 
data consist of points from both population A and popu- 
lation B with associated probabilities Pj < 1. In this ta- 
ble and all following tables we add a "Bias" column that 
shows the deviation of the recovered parameters from the 
input values in units of standard deviations. 

For the spectroscopic sample the errors just scale like 
oa/VN- Each of the other supernova contributes to the 
"good" measurement with probability Pj, i.e. each data 
point has a weight Pj , or an effective error bar a a / \/~Pj 
on average. Defining the average weight 

1 N r 

w ^nH P ^ dPPf(P) = (P) (15) 
i=i 

where f(P) is the normalised probability distribution 
function of the Pj, we find that the error on /i scales 
as 



y/N s + wN ph 

for N s spectroscopic measurements (Pj = 1) and N p h 
uncertain (photometric only) measurements with average 
weight w. As can be seen in Fig. [31 the errors on \i 
recovered by the Bayesian formalism do indeed follow 
this formula, although they can be slightly worse if the 
two populations are more difficult to separate than in 
this example. 

In our example where f(Pj) oc Pj the weight is 
w = 2/3, so that three photometric supernovae equal 
two spectroscopic ones. The expected error in fj-A for the 
example of table His therefore 0.1/^10 + 2/3 x 1000 » 
0.004, in agreement with the numerical result. If we had 
used only the 10 spectroscopic data points, the error 
would have been 0.032 so that the use of all available 
information improves the result by a factor eight. In the 
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case where f(Pi) oc (1 — Pi), i.e. we are dealing pre- 
dominantly with type B data, we have a weight of 1/3. 
If it is easier to measure three photometric supernovae 
compared to one spectroscopic one, it will still be worth 
the effort in this case. We should point out here that 
these are the optimal errors achievable with the data. In 
Fig. [3] we show the actual recovered error from random 
implementations with different w and effective number of 
SNIa given by: 

N cii = N s + wN ph . (17) 

We see that the Bayesian algorithm achieves nearly op- 
timal errors (black line). 




10 100 1000 10 4 

N e „ 

FIG. 3: Scaling of the errors: The black line shows the ex- 
pected (optimal) error, which is inversely proportional to the 
effective number of SNIa given by VN e g in eq. (|17[l . The 
different colours and shapes correspond to different distribu- 
tions of the probabilities Pi (i.e. how many data points have 
Pi = 0.9, how many have Pi = 0.8, etc.). The points show the 
actually measured error for BEAMS given these distributions 
of the probabilities Pi of the data. BEAMS is able to use 
nearly all of the information available. 

We now compare the Bayesian method to some other 
possible methods: 

• Use only spectroscopic SNIa. 

• Use only SNIa with probabilities above a certain 
limiting threshold, P. A limit of 0% uses all data 
points, and a limit of 100% only the spectroscopi- 
cally confirmed points. 

• Weight the \i value for the i-th point by a function 
of Pi . This effectively corresponds to increasing the 
error for data points with lower probability. For the 
test, we use the weighting aj — ► Uj/P^ 2 . For N = 



this reverts to the limiting case where we just use 
all of the data in the usual way. For N > points 
are progressively more and more heavily penalised 
for having low probabilities. 

These ad-hoc prescriptions are not necessarily the only 
possibilities, but these were the methods we came up 
with for testing BEAMS against. We now discuss their 
application to the same test-data described above to see 
how they perform against BEAMS. 
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FIG. 4: A comparison between different methods (see text). 
Of these methods, only the BEAMS method and the use of 
the spectroscopic points alone are unbiased. As it can use 
the uncertain data, the BEAMS method improves the error 
bars in this example by the (expected) factor of 8. (For the 
error-weighted method not all values of N were plotted at the 
high-N end.) 

Figure 0] shows very clearly that although the ad- hoc 
prescriptions for dealing with the type-uncertainty can 
lead to very precise measurements, they cannot do so 
without being very biased. Both the Bayesian and the 
pure-spectroscopic approach recover the correct value 
(bias less than one a), but the latter does so at the ex- 
pense of throwing away most of the information in the 
sample. 

We can also use BEAMS to get a posterior estimate of 
the population type, based on the prior value (e.g. from 
multicolour light curves) and the distribution. To do this 
for data point j we marginalise over all entries r except 
Tj, and additionally over all estimated parameters. In 
practice this means that the j — th entry in Eq. (|14|) is 
fixed to Caj, and that we also integrate over [ia in ad- 
dition to fiB and ob- Effectively, we compute the model 
probability if the j'-th point is assumed to be of type 
A and compare it to the model probability without this 
constraint. The relative probability of the two cases then 
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tells us the posterior probability for the model vector r 
having the j-th entry equal to A, corresponding to the 
posterior probability of the j-th. supernova to be of type 
A. Fig. [5] shows an example case (using a Gaussian ap- 
proximation to evaluate the integral over all values of the 
sample mean p>a). We see how the posterior probability 
to belong to population A depends both on how well the 
location of a point agrees with the distribution of A (left) 
and on how high its prior probability was (right) . In other 
words, we can reconstruct which points came from which 
distribution from the agreement between their values of 
fi and fiA and their prior probabilities (which is indeed 
all the information at our disposal in this scenario). 

For the toy example the two distributions are quite dif- 
ferent, and BEAMS classifies all points within about 3a 
of TTi — to be of population A. The prior probability is 
here strongly overwhelmed by the data and the resulting 
posterior probabilities lie close to and 1 for most data 
points. 




-0.5 0.5 0.2 0.4 0.6 0.8 1 
8 prior probability 



FIG. 5: BEAMS as a classification algorithm: we plot the 
posterior probability of the data points to be of type A in 
our toy example. This depends on their value m.j (left panel) 
and their prior probability (right panel). The better a point 
agrees with the recovered distribution of type A, the higher 
its posterior probability to belong to it. See text for more 
information. For this test case we know the true nature of the 
points, and plot population A as red circles and population B 
as blue crosses (see Table U for distribution characteristics). . 



In the following section we extend this basic model in 
two main directions. Firstly, as reality starts to deviate 
from the model, there is a danger of introducing a bias. 
We discuss a few simple cases and try to find ways of 
hardening the analysis against the most common prob- 
lems. Secondly, we extend the model to more than two 
families, and we also discuss the possibility of using the 
information on the other populations in the analysis it- 
self. 



IV. EXTENSIONS 
A. Uncertain probabilities 

While the likelihoods used in the estimation of [ia 
(which will yield 9) are the same for the earlier example, 
in this section our treatment and use of the probabilities 
Pi differs as we begin to include possible error in the Pi's. 

Often one may not know the population probability 
Pj precisely, but has instead a probability distribution. 
For example Pj may be roughly known, but has an error 
associated with it (in the SN case this could be due to 
some systematics in the lightcurve fitting process). In 
this case we have to marginalise over all those probability 
distributions. For N supernovae this then requires an N- 
dimensional integration. It is straightforward to include 
this in a MCMC approach by allowing all Pj to be free 
variables, but with N of the order of several thousand 
it may be difficult to get a precise result. On the other 
hand this may still be better than just sampling Pj at a 
single point if it is not known exactly. 

However, if the measurements are independent, then 
each integral affects only one of the terms in the product 
over all data points in Eq. (fT3|) . Instead of one N dimen- 
sional integration we are dealing with N one-dimensional 
integrations which are much easier to compute. In gen- 
eral we have to integrate each term over the probability 
Pj with a given distribution n(pj). The case of a known 
probability corresponds to n(jpj) — S(pj — Pj). The next 
simplest example is the case of a totally unknown prob- 
ability Pj, for which = 1. In this case the integral 
to be solved in each term is 

J dpj {C A ,jPj + - Pj)) = \ ( l a,j + £b,j), (18) 

where C,A,j and Cbj are the likelihood values of the j-th 
data point assuming population A or B respectively. The 
effective probability here turns out to be Pj = 1/2. The 
reason is that we estimate this probability independently 
for each supernova, and do not have enough information 
to estimate it from the data. In the following subsection 
we replace this approach instead with a global uncertain 
probability added to the known distributions. This global 
probability can then be estimated from the data. 

For now, assume we have an approximate knowledge 
of the type-probabilities, say, an independent uncertainty 
on each Pj , Sj , so that 

■K(pj) oc e 24 j , (19) 

where the proportionality constant is chosen so that the 
integral over n(pj) from zero to one is 1. If the random 
error on Pj is small enough that the distribution function 
is well contained within the domain of integration, i.e. 
Pj + S j <C 1 and Pj — 5j 3> 0, then we recover just pj = 
Pj. In this case the Gaussian distribution function acts 
effectively as a delta- function. For large uncertainties, 
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shift of Pj fiA Bias [a] 



+0.1 


0.021 ± 0.004 


5.5 


+0.2 


0.128 + 0.004 


31.8 


+0.4 


0.408 ± 0.004 


96.8 


-0.4 


0.003 ± 0.005 


0.6 



TABLE II: Results with a systematic shift (i.e. bias) in the 
probabilities Pj. Positive shifts lead to a systematic bias in 
the results, while negative shifts lead to sub-optimal errors. 
However, the negative shifts will bias instead the inferred 
properties of population B. 

or for probabilities close to the boundaries, corrections 
will become important and can bias the result. For the 
specific case of random errors, the correction term is of 
the form Caj — £-B.j- If we suspect large random errors 
it may be worth adding this term with a global pre-factor 
of its own to the full posterior. On the other hand, in 
real applications we expect that the probabilities close to 
Pj = 1 are quite well known, so that the boundary error 
is hopefully not too important. 

A fixed, common shift is much more worrying and can 
bias the results significantly. This can be seen in table 
ILT1 where we added a systematic shift to the probabili- 
ties (enforcing < Pj < 1). This is an especially im- 
portant point for photometric supernova analyses, where 
dust reddening can bias the classification algorithm. If 
we do not take into account this possibility, then the 
analysis algorithm fails because it starts to classify the 
supernovae wrongly, but hopefully such a large bias is 
unrealistic. 

At any rate, a bias is readily dealt with by including a 
free (global) shift s into the probability factors of Eq. fTS")) 
and by marginalising over it, resulting in 

P{^a^b,o- b \D) oc 

1 N 

It may be a good idea to include such a shift and to check 
its posterior distribution. Given enough data it does not 
significantly impact the errors, and it adds stability also 
in the case of large random uncertainties in the Pj . We 
found that an additive bias with a constant prior was 
able to correct all biasing models that we looked at, as is 
shown in table Hill However, the presence of a significant 
shift would indicate a failure of the experimental setup 
and should be taken as a warning sign. 

A free individual shift is degenerate with the case of 
random uncertainties above, as it cannot be estimated 
from the data, and is not very useful in this context. 

B. Global uncertainty 

Given how critical the accuracy of the type-probability 
Pj is in order to get correct results, it may be preferable, 



shift of P 3 




Bias [a] 


recovered shift 


+0.0 


-0.003 ± 0.004 


-0.8 


0.002 + 0.011 


+0.1 


-0.004 ± 0.004 


-1.0 


0.073 ± 0.012 


+0.2 


-0.000 ± 0.004 


-0.1 


0.158 + 0.015 


+0.4 


-0.002 ± 0.004 


-0.6 


0.286 + 0.016 


-0.4 


0.004 ± 0.004 


1.0 


-0.396 + 0.013 



TABLE III: Same as table [TTJ but the model allows for a bias 
(shift) in the Pj. As most supernovae are population A, with 
f(Pj) oc Pj, the recovered shift grows slower than the input 
shift. However, it still removes any bias in the estimation of 



shift of Pj 


MA 


Bias [a] 


global probability 


+0.0 


-0.003 ± 0.004 


-0.8 


0.66 ± 0.02 


+0.1 


-0.004 + 0.004 


-0.9 


0.68 ± 0.02 


+0.2 


0.000 ± 0.004 


0.0 


0.66 ± 0.02 


+0.4 


-0.003 ± 0.004 


-0.7 


0.64 + 0.02 


-0.4 


0.004 ± 0.004 


0.9 


0.65 ± 0.02 



TABLE IV: Same as table[H] but the model uses an estimated 
global probability p for all supernovae and does not use the 
Pj (so in reality all runs above are the same). The expected 
global probability is p — N e fi/N w 0.66. 



as an additional test, to discard this information com- 
pletely. This helps to protect against wrongly classified 
outliers and the unexpected breakdown or biasing of the 
classification algorithm. 

Even if the probability for a supernova to be either of 
type la or of another type is basically unknown, corre- 
sponding to a large error on all the Pi, not all is lost. We 
can instead include a global probability p that supernovae 
belong to either of the groups, and then marginalise over 
it. In this way, the data will pick out the most likely value 
for p and which observations belong to which class. In 
terms of the posterior <| 1 3j> this amounts just to replacing 
all Pj with p and to marginalise over it, 

P(ha,Vb,o- b \D) cc 

1 N 

The prior on p, P{p), contains any knowledge that we 
have on the probability that any given supernova in our 
survey is of type la. If we do not know anything then 
a constant prior works well. As this is a global prob- 
ability (i.e. all supernovae have the same p), we can- 
not in this form include any "per supernova" knowledge 
on p, gained for example from spectra or light curves. 
For this we need to revert to the individual probabilities 
discussed previously. However, it is a good idea to in- 
clude the spectroscopic (known to be good) points with 
an explicit p = 1 as they then define which population is 
the "good" population and generally make the algorithm 
more stable. 
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In our numerical tests with the toy model described in 
section IIIII this approach works very well, see table IIVI 
However if the two distributions are difficult to separate, 
with similar average and dispersion, then the algorithm 
can no longer distinguish between them and concludes 
that the data is compatible with having been drawn from 
a single distribution with averaged properties. This does 
normally not lead to a high bias, since otherwise the data 
would have been sufficient to tease the populations apart. 
Nevertheless, it seems preferable to use the relative prob- 
abilities for the supernova types when the information is 
available and reliable. 



C. Several populations 



different populations by comparing their evidence factor, 
since by Bayes theorem (again), 

P(M\D) = P(D\M)^±. (24) 

The relative probability of models with m\ and m-2 pop- 
ulations is then 

P(D\ mi ) P( mi ) 

P(D\m s ) P(m 2 ) 1 ' 

and usually (in absence of additional information) the 
priors are taken to be P(mi) = P(w,2) so that the evi- 
dence ratio gives directly the relative probability. 



For an experiment like the SDSS supernova survey, a 
more conservative approach may be to add an additional 
population with a very wide error bar that is designed to 
catch objects that have been wrongly classified as super- 
novae, or those which got a very high la probability by 
mistake. 

Of course there is no reason to limit ourselves to two 
or three populations, given enough data. If we end up 
with several thousand supernovae per bin we can try to 
use the data itself to understand the different sub-classes 
into which the supernovae can be divided. 

The expression (fT"3|) can be straightforwardly gener- 
alised to M classes Ai of objects (for example supernova 
types) with their own means \ii and and errors <Xj as well 
as the probability for data point j to be in class Ai of P?, 

1 N ( M 1 

P( Pt ,a t \D) (x — g [] E^'^) P i • ( 22 ) 

1 li=l a i j=l { i=l J 

For each data point j the probabilities have to satisfy 
Pj = 1 . Of course there has to be at least one class 
for which the model is known, i.e. for which we know the 
connection between fj,j and the (cosmological) parame- 
ter vector 9 (the "la" class in the supernova example), 
or else it would not be possible to use this posterior for 
estimating the model parameters 9 and we end up with 
a classification algorithm instead of constraining cosmol- 
ogy 

It is possible that we even do not know how many 
different populations to expect. In this case we can just 
keep adding more populations to the analysis. We should 
then also compute the evidence factor as a function of the 
number M of populations, P(D\M), by marginalising the 
posterior of Eq. (0| over the parameters, 

P(D\M)=J2p(D\9,t)P(9,t). (23) 

This is just the integral over all ^ and ai of the "poste- 
rior" that we have used so far, Eq. (|2"2"|) , since we did not 
normalise it. Once we have computed this factor, then we 
can compare the relative probabilities of the number of 



D. Combined Formula 

What is the best way to combine the above approaches 
for future supernova surveys? There is probably no "best 
way". For the specific example of the SDSS supernova 
survey the probabilities for the different SN populations 
are derived from x 2 fits to lightcurve templates [25| . We 
expect three populations, la, Ibc and II, and objects that 
are not supernovae at all. We expect that last class to 
be very inhomogeneous, but we would like to keep the 
supernovae. From the spectroscopically confirmed su- 
pernovae we can learn what the typical goodness-of-fit of 
the templates is expected to be and so calibrate them. 
Supernovae where the x 2 of all fits is, say, 10 higher than 
for the typical spectroscopic cases are discarded. For the 
reminder we set Ki = exp(— (xo ~ X 2 )i/2) where Xo is 
the typical value for each population. If m > 1 then 
we set the probabilities to be Pj = ~Kj / 7Tj , otherwise 
Pj = it j . We also write again the more general 9 for 
the parameters of interest. 9 can represent for example 
cosmological parameters, or the luminosity distance to 
a redshift bin. The connection between 9 and the data 
is specified in the likelihoods P(Dj\6, . . .) which in gen- 
eral compare the measured magnitude to the theoretical 
value, with the theoretical value depending on the 9, in 
other words P(Dj\9,la.) = Ci a .j(9), and so on. The full 
formula then is 

P(9\D) oc p ( ) p ( b ) p (Z) x 

N 

l[{p(Dj\e,i a )p(i a )j 

3=1 

+P{Dj\6,b Ihc ,E lhc ,lbc)P(Ibc)j (26) 
+P(D 3 -|0,&n > £n,n)P(n) i 
+P(Dj\9,b x ,X x ,X) 
x(l-P(U)j-P(Ibc)j-P(II)j)} 

If on the other hand we do not trust the absolute values 
of the x 2 then we can either add a bias to safeguard 
against a systematic shift in the absolute probabilities, or 
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allow for a global Px that an object is no supernova at all. 
For this we always normalise the supernova probabilities 
to unity, Pj = Trj/X^ 71 ^) and use the likelihood 

P(0\D) oc P{d)P{b)P{Y?)P{Px) x 

b k ,T, k ,P x 
N 

Y[{[P(D,\9,l a )P(l^ 

3=1 

+P( J D i |e,6 Ibc ,Ei bc ,Ibc)P(Ibc) j (27) 

+P(D j \e,b u , Sii,H)P(I%](1 - Px) 
+P(D j \6,bx,Zx,X)P x } 

It is probably a good idea to always run an analysis 
with additional safeguards like this, and preferably a free 
global bias in the la probability, in parallel to the "real" 
analysis in case something goes very wrong. The global 
bias A might be added as 

2 

P(9\D) ex P(9) ]T P(P X ) II P(b k )P(^k)l[P(\) x 

6 fc ,S fc ,A,,P x fc6{Ibc,II,X} i=l 
N 

H{[P(D J \9M) [P(Ia),- - Ax - A 2 ] 

+P(D ] \9,b lhc ,Z lhc Jbc) [P(Ibc) j + Ai] (28) 
+P(£>j|0, 6n, En, II) [P(H)j + A 2 ] 1 (l - P x ) 



-P(D 3 \9,bx,Z x ,X)P x }, 



Especially the bias A 2 of the la vs II probability is 
useful to catch problems due to dust-reddening which 
can lead to a confusion between these two classes [32| . 

While estimating a dozen additional parameters is not 
really a problem statistically if we have several thousand 
data points, it can become a rather difficult numerical 
problem which justifies some work in itself. We are using 
a Markov-chain monte carlo code with several simulated 
annealing cycles to find the global maximum of the pos- 
terior, which seems to work reasonably well but could 
certainly be improved upon. 

We notice that in addition to a measurement of the 
model parameters 9 from the la supernovae, we also get 
estimates of the distributions of the other populations. 
In principle we could feed this information back into the 
analysis. Even though the prospect of being able to use 
the full information from all data points is very tempting, 
we may not win much from doing so. We would expect 
that the type-la supernovae are special in having a very 
small dispersion in the absolute magnitudes. As such, 
they carry a lot more information than a population with 
a larger dispersion. In terms of our toy-example where 
(T a = 0.1 and <jb = 2 we need (ctb/cta) 2 = 400 times 
more population B points to achieve the same reduction 
in the error. Unless we are lucky and discover another 
population with a very small dispersion (or a way to make 
it so), we expect that the majority of the information will 
always come from the SNIa. 



V. CONCLUSIONS 

We present a generalised Bayesian analysis formalism 
called BEAMS (Bayesian Estimation Applied to Multi- 
ple Species) that provides a robust method of parameter 
estimation from a contaminated data set when an es- 
timate of the probability of contamination is provided. 
The archetypal example we have in mind is cosmological 
parameter estimation from Type la supernovae (SNIa) 
lightcurves which will inevitably be contaminated by 
other types of supernovae. In this case lightcurve tem- 
plate analysis provides a probability of being a SNIa ver- 
sus the other types. 

We have shown that BEAMS allows for significantly 
improved estimation when compared to other estimation 
methods, which introduce biases and errors to the result- 
ing best-fit parameters. 

BEAMS applies to the case where the probability, Pj, 
of the i-th point belonging to each of the underlying dis- 
tributions is known. Where the data points are indepen- 
dent, repeated marginalisation and application of Bayes' 
theorem yields a posterior probability distribution that 
consists of a weighted sum of the underlying likelihoods 
with these probabilities. Although the general, corre- 
lated, case where the likelihood does not factor into a 
product of independent contributions is simple to write 
down, it contains a sum over 2 N terms (for 2 populations 
and N data points). This exponential scaling makes it 
unsuitable for application to real data where N is easily 
of the order of a few thousand. This case will require 
further work. 

We have studied in some detail the simple case of es- 
timating the luminosity distance in a single redshift bin 
from one population consisting of SNIa candidates and 
another of non SNIa candidates. In addition to an op- 
timal estimate of the luminosity distance, by including 
the free shift b and width £ of the wide Gaussian dis- 
tribution as variables in the MCMC estimation method, 
the BEAMS method also allows one to gain insight into 
the underlying distributions of the contaminants them- 
selves, which is not possible using standard techniques. 
Provided that the model for at least one class of data are 
known, this method can be expanded to more distribu- 
tions, each with their own shift bi and width Xj. 

BEAMS was tested against other methods, such as us- 
ing only a spectroscopically confirmed data set in a \ 2 
analysis; using only data points with probabilities higher 
than a certain cut off value, and weighting a \ 2 value by 
some function of the probability. The Bayesian method 
performs significantly better than the other methods, and 
provides optimal use of the data available. In the SNe 
la case, the Bayesian framework provides an excellent 
platform for optimising future surveys, which is specifi- 
cally valuable given the high costs involved in the spec- 
troscopic confirmation of photometric SNe candidates. 

A Bayesian analysis is optimal if the underlying model 
is the true model. Unfortunately in reality we rarely 
know what awaits us, and it is therefore a good idea 
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to add some extra freedom to the analysis, guided by 
our experience. In this way BEAMS can also be applied 
when the population probability is not known precisely. 
In this case a global uncertainty is added to the known 
probability distributions, which can be estimated from 
the data. In the case of the SNe la, one can include a 
global probability p that the supernovae belong to either 
group, and then marginalise over it, allowing the data 
to not only estimate the most likely value for p but also 
to separate the data into the two classes. This global 
approach can protect against outliers when the accuracy 
of the type-probability is not known precisely. It is one 
of the strengths of Bayesian approaches that they allow 
one to add quite general deviations from perfect data, 
which arc then automatically eliminated from the final 
result, and to compute the posterior probability that such 
surprises were present. 

A robust method of application of BEAMS to data 
from future supernova surveys is proposed to estimate 
the properties of the contaminant distributions from the 



data, and to obtain values for the desired parameters. 
Although we have illustrated and developed the BEAMS 
algorithm here with explicit references to a cosmological 
application, it is far more general. It can be easily applied 
to other fields, from photometric redshifts to other astro- 
nomical data analyses and even to other fields like e.g. 
biology. Since it is Bayesian in nature, it can very eas- 
ily be tailored to the specific needs of a subject, through 
simple and straightforward calculations. 
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from the data. In addition to these challenges, the tem- 
plate correction is usually computed using the supernova 
sample itself and may introduce some correlations. But 
in general, if the property of supernovae which makes 
them standard candles depends on the sample then we 



should be worried. So here we assume that the template 
corrections and errors were derived previously with the 
spectroscopic supernovae. Indeed, strictly speaking, we 
should use a different sample for that purpose, or else 
estimate those parameters as well as the global disper- 
sion simultaneously with the cosmological parameters. In 
the latter case it is important to keep the 1/cr normalisa- 
tion of the likelihood and to use additionally a "Jeffreys 
prior" oc 1/cr to avoid a bias towards larger dispersions. 
In the case where correlations are important then one 
must compute the full probability which is computation- 
ally intense, though systematic perturbation theory may 
be useful for small correlations. 

[44] We thank Alex Kim for pointing this out to us. 

[45] Again we stress that for the sake of clarity and generality 
we assume that we know the redshifts of the SN perfectly. 
If the redshift must also be estimated from the data then 
the formalism below must be extended in the obvious 
way. 

[46] Although this specific example considers a Gaussian dis- 
tribution for the rrii of the "non-la" population B, which 
corresponds its likelihood, we have also tested the algo- 
rithm for other distributions. 

[47] In this case its prior needs to be oc 1/cr a to avoid biases. 



